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Abstract. To maximize its success, an AGI typically needs to explore 
its initially unknown world. Is there an optimal way of doing so? Here 
we derive an affirmative answer for a broad class of environments. 

1 Introduction 

An intelligent agent is sent to explore an unknown environment. Over the course 
of its mission, the agent makes observations, carries out actions, and incremen- 
tally builds up a model of the environment from this interaction. Since the way 
in which the agent selects actions may greatly affect the efficiency of the explo- 
ration, the following question naturally arises: 

How should the agent choose the actions such that the knowledge about 
the environment accumulates as quickly as possible? 

In this paper, this question is addressed under a classical framework, in which 
the agent improves its model of the environment through probabilistic inference, 
and learning progress is measured in terms of Shannon information gain. We 
show that the agent can, at least in principle, optimally choose actions based 
on previous experiences, such that the cumulative expected information gain is 
maximized. We then consider a special case, namely exploration in finite MDPs, 
where we demonstrate, both in theory and through experiment, that the opti- 
mal Bayesian exploration strategy can be effectively approximated by solving a 
sequence of dynamic programming problems. 

The rest of the paper is organized as follows: Section 2 reviews the basic 
concepts and establishes the terminology; Section 3 elaborates the principle of 
optimal Bayesian exploration; Section 4 focuses on exploration in finite MDP; 
Section 5 presents a simple experiment; The related works are briefly reviewed 
in Section 6; Section 7 concludes the paper. 

2 Preliminaries 

Suppose that the agent interacts with the environment in discrete time cycles 

t = 1,2, In each cycle, the agent performs an action a, then receives a 

sensory input a. A history h is either the empty string or a string of the 
form aiOi • • • ajOt for some t, and ha and hao refer to the strings resulting from 
appending a and ao to h, respectively. 



2.1 Learning from Sequential Interactions 



To facilitate the subsequent discussion under a probabilistic framework, we make 
the following assumptions: 

Assumption I. The models of the environment under consideration are fully 

described by a random clement which depends solely on the environment. 
Moreover, the agent's initial knowledge about is summarized by a prior 
density p (B). 

Assumption II. The agent is equipped with a conditional predictor p (o\ha] 9), 
i.e. the agent is capable of refining its prediction in the light of information 
about 0. 

Using p [6) and p {o\ha; 9) as building blocks, it is straightforward to formu- 
late learning in terms of probabilistic inference. Prom Assumption I, given the 
history ft,, the agent's knowledge about is fully summarized by p {9\h). Accord- 
ing to Bayes rule, p (6'|/iao) = ^^^lypy ■'^^ , with p {o\ha) = J p{o\ha,9)p{9\h) d9. 
The term p {9\ha) represents the agent's current knowledge about given history 
h and an additional action a. Since depends solely on the environment, and, 
importantly, knowing the action without subsequent observations cannot change 
the agent's state of knowledge about 0, p{6\ha) =p{6\h), hence the knowledge 
about can be updated using 

p(^e\hao)=p{9\K)--^. (1) 

It is worth pointing out that p {o\ha\ 9) is chosen a priori. It is not required 
that they match the true dynamics of the environment, but the effectiveness 
of the learning certainly depends on the choices of p{o\ha]9). For example, if 
6* € M, and p{o\ha\9) depends on 9 only through its sign, then no knowledge 
other than the sign of can be learned. 



2.2 Information Gain as Learning Progress 

Let h and h' be two histories such that /i is a prefix of h' . The respective pos- 
terior of are p{9\h) and p{0\h'). Using /i as a reference point, the amount of 
information gained when the history grows to h' , can be measured using the KL 
divergence between p{9\h) and p{6\h'). This information gain from h to h' is 
defined as 

g{h'\\h) = KL {p {e\h') \\p {9\h)) = Jp {9\h') log J^d9. 

As a special case, if /i = 0, then g [h') = g (/i'||0) is the cumulative information 
gain with respect to the prior p (9). We also write g {ao\\h) for g {hao\\h), which 
denotes the information gained from an additional action-observation pair. 

From an information theoretic point of view, the KL divergence between two 
distributions p and q represents the additional number of bits required to encode 



elements sampled from p, using optimal coding strategy designed for q. This can 
be interpreted as the degree of 'unexpectedness' or 'surprise' caused by observing 
samples from p when expecting samples from q. 

The key property information gain for the treatment below is the following 
decomposition: Let /i be a prefix of h' and h' be a prefix of h" , then 

E^"|h'ff(/i"||/i) 

= E„,„Jp{e\h")iog?^de 



, p{e\h") , p{e\h') 

p{9\h') p{0\h) 



= ¥.„,\^, Jp(e\h")\og^^d9 + jv.^„^^,p{e\h")\ogP-^j^de 

= ¥.„„\,,g{h"\\h')+ J Eh„^^,p{0\h'')\og?^^d9. 

From updating formula Eq|l] 

Y,P{o\ha)p{6\hao) = Y,P (^1^) P {o\ha,0) 

O O 

= pi9\h)J2p{o\ha,9) 

O 

= p{9\h). 

Using this relation recursively, 

ai oi at Ot 

= E E " ' E E ^ {9\h'aiOi ■ ■ ■ at-iot-i) 

ai oi at-i ot-i 

^p{9\h'), 

therefore 

^h"\h'9 {h"\\h) = g (h'Wh) + E„„|,,g {h"\\h') . (2) 

That is, the information gain is additive in expectation. 

Having defined the information gain from trajectories ending with obser- 
vations, one may proceed to define the expected information gain of perform- 
ing action a, before observing the outcome o. Formally, the expected informa- 
tion gain of performing a with respect to the current history h is given by 



g {a\\h) = Ko\hag {ao\\h). A simple derivation gives 

g{a\\h) ^"^pio\ha) J p (6*1/100) log ^ ^^''""'^ 



de 



pie\h) 

f f a\u M P p {o\ha) 

E fpMha)iog de 

p {0\ha) p {o\ha) 

I{O]0\ha) , 



which means that g {a\\h) is the mutual information between O and the random 
variable O representing the unknown observation, conditioned on the history h 
and action aQ 



3 Optimal Bayesian Exploration 

In this section, the general principle of optimal Bayesian exploration in dynamic 
environments is presented. We first give results obtained by assuming a fixed 
limited life span for our agent, then discuss a condition required to extend this 
to infinite time horizons. 



3.1 Results for Finite Time Horizon 

Suppose that the agent has experienced history h, and is about to choose r 
more actions in the future. Let tt be a policy mapping the set of histories to the 
set of actions, such that the agent performs a with probability -k {a\h) given h. 
Define the curiosity Q-value (h, a) as the expected information gained from 
the additional r actions, assuming that the agent performs a in the next step 
and follows policy tt in the remaining r — 1 steps. Formally, for r = 1, 

qI {h, a) = E^i^^^g {ao\\h) = g {a\\h) , 

and for r > 1, 

(/i, a) = Eo\haEai\haoEoi\haoai ' ' ' Eo,_i | h-a,_ 1 5 (haoaiOi ■ ■ ■ iO^_ 1 1| /l) 

= Eo\ha'^aiOi-a-,-io^-i\hao9 (haoaiOi ■ ■ ■ ar-lOr-l\\h) . 



^ Side note: To generalize the discussion, concepts from algorithmic information the- 
ory, such as compression distance, may also be used here. However, restricting the 
discussion under a probabilistic framework greatly simplifies the matter. 



The curiosity Q- value can be defined recursively. Applying Eq. [2] for t = 2, 



ql {h, a) = Eo\haKaioi\haog {haoaiOi\\h) 

= Eo|/ia [9 {ao\\h) +Ea,oi\hao9 (aiOiWhao)] 
= 9 (aWh) + Eo\haEa'\haoql [hao, a') . 

And for r > 2, 

ql {h,a) = Eo|,iaEaioi-a,_io,_i|hao3 [haoaiOi ■ ■ ■ ar-lOr^l\\h) 

= Eo\ha [9 {o.o\\h) + Eaioi-a,_io,_i5 {haoQiOi ■ ■ ■ ar-iOr-i\\hao)\ 

= 9 {a,\\h) + Eo\haEa'\haoq7'^ {hao, a') . (3) 

Noting that Eq|3] bears great resemblance to the definition of state-action values 
{Q{s,a)) in reinforcement learning, one can similarly define the curiosity value 
of a particular history as {h) = Ea\h9l i^, a), analogous to state values {V{s)), 
which can also be iteratively defined as (h) = E^^^i^g {a\\h), and 

vl (h) = Ea\h [9 ia\\h) + E„|,,,<-i (hao)] . 

The curiosity value (h) is the expected information gain of performing the 
additional r steps, assuming that the agent follows policy tt. The two notations 
can be combined to write 



ql {h,a) = g {a\\h) +Eo\havl ^ (hao) . 



(4) 



This equation has an interesting interpretation: since the agent is operating 
in a dynamic environment, it has to take into account not only the immediate 
expected information gain of performing the current action, i.e., g (a||/i), but also 
the expected curiosity value of the situation in which the agent ends up due to 



the action, i.e. 



(hao). As a consequence, the agent needs to choose actions 



that balance the two factors in order to improve its total expected information 
gain. 

Now we show that there is a optimal policy tt* , which leads to the maximum 
cumulative expected information gain given any history h. To obtain the optimal 
policy, one may work backwards in r, taking greedy actions with respect to the 
curiosity Q- values at each time step. Namely, for t = 1, let 

q^ {h, a) ~ g {a\\h) , vr^ [h) = argmaxg {a\\h) , and {h) — maxt; {a\\h) , 

a a 

such that (h) = q^ {h,TT^ {h)), and for r > 1, let 



q'^ {h,a) ^ g {a\\h) +Eo\ha max?"" ^ {a'\hao) = g {a\\h) + Eo\hav'^ ^ {hao 



with ttI {h) = argmaxa g"^ (/i, a) and v'^ {h) = max^ g"^ (/i, a). We show that 
ttJ {h) is indeed the optimal policy for any given r and h in the sense that 



the curiosity value, when foUowing tt^, is maximized. To see this, take any other 
strategy tt, first notice that 

{h) = maxg {a\\h) > Ea\h9 {a\\h) = v\ {h) . 

a 

Moreover, assuming v'^ (h) > (/i), 

ti'^+i (h) ^ max [g{a\\h) -\-Eo\hav'' (hao)] > max [g {a\\h) + Eoi/^a^^ (hao)] 

a a 

> E,|, [g {a\\h) + E„|;,,< (hao)] = (h) . 

Therefore w"^ (h) > (h) holds for arbitrary r, h, and tt. The same can be shown 
for curiosity Q- values, namely, q'^ (h, a) > {h, a), for all t, h, a, and tt. It may 
be beneficial to write g"^ in explicit forms, namely, 

[h, a) = E^a maxEo,\haoai ' ' ' max Eo^_ i,^. . ^ {haoaioi ■ ■ ■ a^_io^_i||/i) . 

ai a-r-l 

Now consider that the agent has a fixed life span T. It can be seen that at time 
t, the agent has to perform 7r^~* (^i-i) to maximize the expected information 
gain in the remaining T — t steps. Here ht-i = aiOi ■ ■ ■ at~iOt-i is the history 
at time t. However, from Eqj2] 

^hT\ht-ig (hr) = g (ht-i) +Eh,^iht-^g {hrWht-i) . 

Note that at time i, g(ht-i) is a constant, thus maximizing the cumulative 
expected information gain in the remaining time steps is equivalent to maximizing 
the expected information gain of the whole trajectory with respect to the prior. 
The result is summarized in the following proposition: 

Proposition 1. Let qi {h,a) = g{a\\h), vi (h) = maxa gi {h,a), and 

qr (h,a) = g {a\\h) + Eo\haVT-i {hao) , Vr (h) = maxq^ {h,a) , 

then the policy tt* [h) — argmax^ q-^ [h, a) is optimal in the sense that Vr {h) > 
vJ^ {h), q-T {h,a) > q^ {h,a) for any tt, t, h and a. In particular, for an agent 
with fixed life span T , following tt^^j {ht-i) at time t = 1, . . . , T is optimal in 
the sense that the expected cumulative information gain with respect to the prior 
is maximized. 

3.2 Non-triviality of the Result 

Intuitively, the interpretation of the recursive definition of the curiosity (Q) 
value is simple, and bears clear resemblance to their counterparts in reinforce- 
ment learning. It might be tempting to think that the result is nothing more 
than solving the finite horizon reinforcement learning problem using g [a\\h) or 
g {ao\\h) as the reward signals. However, this is not the case. 



First, note that the decomposition Eq[2]is a direct consequence of the for- 
mulation of the KL divergence. The decomposition does not necessarily hold if 
g (h) is replaced with other types of measures of information gain. 

Second, it is worth pointing out that g {ao\\h) and g {a\\h) behave differently 
from normal reward signals in the sense that they are additive only in expectation, 
while in the reinforcement learning setup, the reward signals are usually assumed 
to be additive, i.e., adding reward signals together is always meaningful. Consider 
a simple problem with only two actions. If g {ao\\h) is a plain reward function, 
then g (ao\\h) -\- g (a'o'\\hao) should be meaningful, no matter if a and o is known 
or not. But this is not the case, since the sum does not have a valid information 
theoretic interpretation. On the other hand, the sum is meaningful in expectation. 
Namely, when o has not been observed, from Eq[2j 

g {ao\\h) + Eo,ihaoa'9 {a'o'\\hao) = E^.^^aoa'd {aoa'o'\\h) , 

the sum can be interpreted as the expectation of the information gained from h 
to haoa'o'. This result shows that g {a\\h) or g {ao\\h) can be treated as additive 
reward signals only when one is planning ahead. 

To emphasize the difference further, note that all immediate information 
gains g {ao\\h) are non-negative since they are essentially KL divergence. A nat- 
ural assumption would be that the information gain g{h), which is the sum of 
all g {ao\\h) in expectation, grows monotonically when the length of the history 
increases. However, this is not the case, see Figure [T] for example. Although 
g {ao\\h) is always non-negative, some of the gain may pull closer to its prior 
density p (6), resulting in a decrease of KL divergence between p{9\h) and p (9). 
This is never the case if one considers the normal reward signals in reinforcement 
learning, where the accumulated reward would never decrease if all rewards are 
non-negative. 



3.3 The Algorithm 



The definition of the optimal exploration policy is constructive, which means 
that it can be readily implemented, provided that the number of actions and 
possible observations is finite so that the expectation and maximization can be 
computed exactly. 

The following two algorithms computes the maximum curiosity value v'^ (h) 
and the maximum curiosity Q-value q'^ {h,a), respectively, assuming that the 
expected immediate gain g {a\\h) can be computed. 
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Fig. 1. Illustration of the difference between the sum of one-step information 
gain and the cumulative information gain with respect to the prior. In this 
case, 1000 independent samples are generated from a distribution over finite 
sample space {1,2,3}, with p{x = 1) = 0.1, p{x — 2) = 0.5, and p{x = 3) = 
0.4. The task of learning is to recover the mass function from the samples, 
assuming a Dirichlet prior Dir(^,^,^). The KL divergence between two 
Dirichlct distributions are computed according to [B]. It is clear from the graph 
that the cumulative information gain fluctuates when the number of samples 
increases, while the sum of the one-step information gain increases monotonically. 
It also shows that the difference between the two quantities can be large. 



CuriosityValue(/i, r) 

Input: history h, look-ahead r 

Output: curiosity value v'^ ih) 

l\v ^ 

2 for all possible a 

3 w -s— max {v, CuriosityQValue {h, a, r)) 

4 end for 

5 return v 



CuriosityQValue(/i, a, r) 

Input: history h, action a, look-ahead t 

Output: curiosity Q-value q'^ {h,a) 



q ^ g{a\\h) 
if T 7^ 

for all possible a 
q ■(— q + p {o\ha) •CuriosityValue(/iao, r — 1) 

end for 
end if 
return q 



The complexity of both CuriosityValue and CuriosityQValue are O ((rio^ia)^)) 
where rio and are the number of possible observations and actions, respec- 
tively. Since the cost is exponential on r, planning with large number of look 
ahead steps is infeasible, and approximation heuristics must be used in practice. 

3.4 Extending to Infinite Horizon 

Having to restrict the maximum life span of the agent is rather inconvenient. 
It is tempting to define the curiosity Q-value in the infinite time horizon case 
as the limit of curiosity Q- values with increasing life spans, T — >■ oo. However, 
this cannot be achieved without additional technical constraints. For example, 
consider simple coin tossing. Assuming a Beta (1,1) over the probability of seeing 
heads, then the expected cumulative information gain for the next T flips is given 

by 

«^ ih,)=IiO;X^,...,XT)-\ogT. 

With increasing T, {hi) — oo. A frequently used approach to simplifying 
the math is to introduce a discount factor 7, as used in reinforcement learning. 
Assume that the agent has a maximum t actions left, but before finishing the 
T actions it may be forced to leave the environment with probability 1 — 7 
(0 < 7 < 1) at each time step. In this case, the curiosity Q-value becomes 
g^'T (/i,a) =g{a\\h), and 

ql'-^ {h, a) = (1 - 7) 5 {a\\h) +j[g (a||/i) + E,^haK,\haoql~'''' {hao, a')] 
= 9 {aWh) + l^o\ha^a'\hao(ll~^''^ {hao, a') . 

One may also interpret q^'^ {h, a) as a linear combination of curiosity Q-values 
without the discount, 

T 

ql'-r {h, a) = (1 - 7) 5^ y-'^i {h, a) + j^ql {h, a) , 

t=i 

Note that curiosity Q-values with larger look-ahead steps are weighed exponen- 
tially less. 

The optimal policy in the discounted case is given by 

q^''^ {h, a) = g {a\\h) , v^'"* (h) = maxq^''^ {h, a) , 

a 

and 

q'^''^ {h, a) = g {a\\h) + 7Eo|/jaV^~"^'^ (hao) , v'^''^ {h) = maxg'^'''' (h, a) . 

The optimal actions are given by tt*'"'' (h) = argmax^ q'^''^ {h, a). The proof that 
-kI'^ is optimal is similar to the one for no-discount case and thus is omitted 
here. 

Adding the discount enables one to define the curiosity Q-value in infinite 
time horizon in a number of cases. However, it is still possible to construct sce- 
narios where such discount fails. Consider a infinite list of bandits. For bandit n. 



there are n possible outcomes with Dirichlet prior Dir . . . , The expected 
information gain of pulling bandit n for the first time is then given by 

log n - V' (2) + log ^1 + ^ log n. 

Assume at time t, only the first e*^ bandits are available, thus the curiosity Q- 
value in finite time horizon is always finite. However, since the largest expected 
information gain grows at speed e* , for any given 7 > 0, g'^''^ goes to infinity 
with increasing r. This example gives the intuition that to make the curiosity 
Q-value meaningful, the 'total information content' of the environment (or its 
growing speed) must be bounded. 

The following two Lemmas are useful for later discussion. 

Lemma 1. q^+^'f (h, a) - gp'^ {h, a) = Y^oa^■■■o^^^a^\ha9 iar\\h- ■ ■ o^_i). 



Proof. Expand q^''^ and q^'^^''^ , 

T+1 



t=i 



~{i-i)Y.i'-\i-i^q: 

t=i 

By definition, 

ql'^^ -ql^ ^o\ha^aioi-a^o^\hao9 {haoaiOi ■ ■ ■ a^o^||/i) 

{haoaioi • • • ar-iOr-i\\h) 

^o\ha^aiOi---aT~io-r^i\hao 

[Ea,o,|h-o,_iff [haoaiOi ■ ■ ■ arOr\\h) - g {haoaiOi ■ ■ ■ ar-iOr-iWh)] 

Using Eqig 

^arOr\h -o^_ig {haoaioi ■ ■ ■ arOr\\h) - g {haoaiOi ■ ■ ■ ar^iOr^i\\h) 



thus 



- 9x = YK\ha'^aiOi-a^.io^.i\hao^a^\h-o^^ig {a-rWh ■ ■ ■ Or-l) 



Lemma 2. q^+^''^ {h,a)-q'^'^' {h,a) < 7^Eo|,iq maxai Eoi\haoai ■ • • max^^ g (a^-H/i • --Or-i)- 



Proof. Expand q"^'^ and q'^^^''^ , and note that maxX — maxF < max \X — Y\, 
then 

q-+''' {h,a}-q-^'' {h,a) 

= ^o\ha maxEoi|,iaooi • • • max [g {a\\h) + {ai\\hao) H \- j'^ g {ar\\h ■ ■ ■ Ot-i)] 

- Eo|/ia maxEo i^aoai • • - max [g (a||/i) +7.9 {ai\\hao) H h 7'^"^g (a^_i||/i • • • o^_2)] 

< Eo|^„max{Eo u^joa • • •max[g (a||/i) +73 (ai||/iao) H h 7''g (a^||/i • • • o^_i)] 

- ^oi\haoai ■ ■ ■ max [g {a\\h) + 7Sf {ai\\hao) H h 7^~^S' {o-T-iWh-- ■ o-r-2)] } 

< • • • 

< l''^o\ha maxEoj|,,aoa, • • • maxg (a^||/i • • • o^_i) . 

ai a-r 



It can be seen that if ^oai---o.r-iaT\ha9 (ctII^I' • • Ot-i) is bounded, then both 
ql'"^ and q'''"'^ are a Cauchy sequences with respect to r. 



4 Exploration in Finite Mctrkovian Environment with 
Dirichlet Priors 



In this section we restrict the discussion to a simple case, where the possible 
actions and sensory inputs are finite, and the agent assumes that the environ- 
ment is Markovian, namely, the current sensory input and action is sufficient 
for determining the (probabilities of the) next sensory inputs. Formally, let 
S = {!,••• ,S} and A = {!,••• ,A} be the space of possible sensory inputs, 
to which we referred as 'states', and actions. The dynamics of the environment 
is fully determined by the transition probability p {s'\s, a). 

The agent tries to learn the transition probabilities. Initially, it assumes for 
each (s,a) a Dirichlet prior over the random variable Os,a corresponding to the 
distribution p{-\s,a). Through time, the agent observes the transitions when per- 
forming a at s, and updates its estimate of 0s,a through probabilistic inference. 
Since the Dirichlet distribution is conjugate with multinomial distribution, the 
posterior is still a Dirichlet distribution over Os,a- Therefore, at any time, the 
agent's knowledge about the environment can be fully summarized by a three di- 
mensional array a (s, a, s'), such that Dir (as,a,ij • • • , cts,a,s) is the current (prior 
or posterior) density of 6*8,0, and the definition of the optimal curiosity Q- value 



can be written af0 

lY i.s,a) ^ g{as^a)+l^Pa (s'|s,a)niax(j2:[(7], (s',a') 

s' 



Here g (a^.a) is the expected immediate information gain for Og.a given the 
current parameterization of the Dirichlet distribution. By definition, g (as, a) is 
also the mutual information between and an additional observation. According 
to [S] , the precise form of g is given by 



s 



• • • ,ns]) = logn* - ip {n* + 1) - X! ^ [logn^ - ip {us + 1)] , 



s=l 



where n* = Ug, and ijj (•) is the standard digamma function. By marginalizing 
out Os,a, the predictive probability is given by Pa (s'|s, a) = , and < is the 

operatoi]^ such that a < (s, a, s') is the same as a, except that the entry indexed 
by (s, a, s') is increased by 1. Similarly, for a given policy tt, the curiosity Q- value 
can be written as 

(s, a) = g (as,a) J2 ("'I''*') €%<{s.a.s') «') • 



4.1 Curiosity Q-value in Infinite Time Horizon 

In this subsection we extend the definition of curiosity Q-value to infinite horizon. 
We show that a) the limit limT-_j.oo iZ'a exists, b) the limit limT-_j.oo <z2'^ exists, 
and c) the limit is the solution of the infinite recursion. 

Proposition 2. qj^ {s,a) ~ ImiT-^ao ij'a (•SjQ^) exists for any n, a, s, a, and 
7 G [0,1). Moreover, the convergence is uniform with respect to (s,a) in the 
sense that 

< qZ,a (s> a) - qZiI (s, a) < , Vs, a 

1-7 

where ga = maxg max^ g (ofs.a)- 



Proof. Rewrite the result in Lemma{T]in this context: 

<il'Ja^^ {S, a) - qZ]l (S, a) ^ YEoai-o^^ia-,\ha9 {ar\\h- ■ ■ Or^l) 
^ 7 ^sia2S2---SraT+i\ha9 (^Q^s^a^^^i^ i 

^ We use as,a both for the vector [as,a,i, • • ■ , as,a,s] and the number J^^, as^a,s' ■ The 

meaning should be clear from the context. 
^ Assume that the operator < is right associative, so one can write a <s (si, 02,32) < 

(s2, as, S3) • • • , or simply a < {sia2S2a3S3 • • • ). 



where 



a' = a< {s, a, Si) < (Si, 02, S2) < • • • <3 (Sr-l, dr, Sr) ■ 



Because g (a^ ^) depends only on the transitions when performing a at s, and 

all such transitions arc exchangeable since they arc assumed to be i.i.d. when 
0s,a is given, one can rewrite the expectation in the following form: 

The first and second expectations are taken over the possible final state-action 
pairs {sT,ar+i), from which g (^a'^^^^^^^ is computed. Once (s,a) is fixed, the 

third expectation is taken over the time n that (s, a)-pair appears in the trajec- 
tory sasia2 ■ ■ ■ Sr, i.e., the time that transitions starting from s with action a 
occurs. The rest of the expectations are over the n destinations of the transitions, 
denoted as Xi, • • • ,Xn- By definition, Dir (a^ „) is the posterior distribution af- 
ter seeing xi, - ■ ■ ,Xn, and g (aj, ^) is the expected information gain of seeing the 
outcome of the (n -|- l)-th transition, which we denote Xn+i, thus 

fl("s,a) = I {^s,a;Xn+l\xi, - ■ ■ , Xn) , 

and 

Note that Xi,-- ■ , Xn+i are i.i.d. given 0s,a, therefore 

I {0s,a: Xn+l\Xi, ■ ■ ■ , Xn) 

= I (6's,a; -'^li • • • , Xn+l) — I {Os^a, ^1, ' ' ' , ^n) 

n+1 n 

= H{X,,--- 

i=l i=l 
= H {Xn+l\Xi, • • • , Xn) — H 
< H {Xn+l) - H {Xn+l\0) 

= 1(0; Xn+l) 
= I{0;Xi). 

This means that I [Og a'-i Xn+i\Xi, ■ ■ ■ ,Xn) is upper bounded by I{0',Xi), 
which is the expected information gain of seeing the outcome of the transition 
for the first time. By definition I (0; Xi) = g {as^a), and it follows that 



Therefore, 

< 7"E,E,5 (a,,a) 

< 7^ maxmax^ (a^ a) 

s a ' 

Since ga depends on a only, for any T 

1-7 

This means that q'^''^ (s, a) is a Cauchy sequence with respect to r, thus 
hmT-_j.oo 9?'q (s, a) exists. Also note that the convergence is uniform since ga 
does not depend on {s,a). 



Proposition 3. q2, {s,a) = \imr-^oo qZ''^ (s,a) exists for any a, s, a, and 7 G 
[0, 1). Also the convergence is uniform in the sense that 

0<q'> (s,a)-gr is,a)<-^Y- 

1-7 



Proof. Rewrite the result in Lemmaj2j 

9a ^"^^ {s, a) - qY (s, «) < l'^'^s^\sa maxEs,|^^„2^„<,(^^£,.s^) 

• • • Es^|s^_ia,,a<i(sasi...s,_i> max g (a' 

Since the max operator is only over actions, the proof in the previous proposition 
still holds, so 

^Y""^ (s, a) - {s, a) < 1^9a, 

and the result follows. 

The next proposition shows that is the solution to the infinite recursion. 
Proposition 4. q^. is the solution to the recursion 

s' 

and for any other policy it, (5, a) > c, {^^ '^)- 



Proof. To see that is the solution, taking any e > 0, one can find a r such that 

< I for any (s,a, s'), thanks to 



< |, and 



7,r 



^7 



the fact that there are only finite number of (s, a, s') triples, and the convergence 
from g^'^"*^^ (*i to (■^i o,) is uniform. It follows that 



< e. 



7,r 7 



Since e is chosen arbitrary, q2, (s, a) must be the solution of the infinite recursion. 

The fact that g2 (■S: a) > 9?, a (^j ^) follows from the fact that f-nd g^;^ 
are monotonically increasing on r (by Lemmajlj), and g^'^ ^ llr'a fo^' ^^ly given 



T and TT. 



The propositions above guarantees the existence and optimality of q1, and 
the following discussions would focus on g^. We drop the super-script 7 in the 
rest of this section. 



4.2 Approximation through Dynamic Programming 

The optimal curiosity Q-value is given by the infinite recursion 

9a (s, a) = g(as,a) + 7 '^^^1a<{s,a,s') {s',a'). (5) 

^ as,a a' 
s 

It is impossible to solve this equation directly. A natural idea is to approxi- 
mate this infinite recursion by solving at each time step the following Bellman 
equation, 

qa{s,a) =g(Q;s,a)+7y^-^^^max(7Q(s',a'). (6) 

s' ' 

The Bellman equation can be solved by dynamic programming in time polyno- 
mial on S and A. The algorithm is given below. 



CuriosityExploreDP(a, 7, s, T) 

Input: prior a, discount factor 7, initial state s, number of steps T. 
Output: posterior stored in a 



1 




2 F 4- 










(s,a,s')e5x^x5 


3 for t 


in 1 to T 



4 


TT ^PolicyIteratioii(G, P, 7) 


5 


a -i— TTs 


6 


s' ^NextState(a) 


7 




8 


Gs,a ^ 5'(as,a) 


9 


for s' in 1 to S* 


10 


s,a.s' ^ ^ 


11 


end for 


12 


s s' 



13 end for 



A surprising fact is that when a is large, qa is in fact a very good approxima- 
tion of (7q., which is the central result in this section. We start by investigating 
the properties of the gain g (a* 

4.3 Properties of Expected Information Gain in Dirichlet Case 

The expected information gain of a Dirichlet distribution Dir {ni, ■ ■ ■ ,ns) is 
given by 

g (n) = log (n) - V' (n + 1) - V — [log (n^) + 1)] . 

-^-^ n 

s 

Define 

f (x) = x[ip{x + 1) — log x] 
= 1 — X [log X — ip (x)] . 

then 

^/(n,)-/(n) . 

s 

The following important properties has been proved by Alzer in [I]. 
Theorem 1. / has the following propertied 
a) limjj^o / (x) = 0, hm2;_^oo f (x) = ^ 

^ Alzer's original paper considers the function x (log a: — i]} (x)) — 1 — f (x). Here the 
statements are modified accordingly. 



g{n) = - 



b) f is strictly completely monotonic, in the sense that 

> 0. 

In particular, Theoremjl] shows that / is strictly monotonically increasing, 
and also strictly concave. The following Lemma summarizes the properties about 
/ used in this paper. 



Lemma 3. Define 5„i (x) ~ f {x + m) — f (x) for m > 0. Then 



a) f is sub-additive, i.e., f {x) + f (y) > f {x + y) for x,y > 

b) 6„i (x) is monotonically decreasing on (0, oo). 

c) < xSm (x) < ^—^ — m for x G (0, oo). 



Proof, a) Note that g (n) is mutual information, and the unknown observation 
depends on the parameters of the distribution, therefore g (n) > 0, and 

< g {[x, y]) = [f (x) + f{y)- f {x + y)] . 

x + y 

b) Note that 

Srn (x) = / f (x + s) ds, 

and the result follows from /" (x) < 0. 

c) Clearly, xS^ (x) > because / is strictly increasing. From Intermediate 
Value Theorem, there some 6 G (0,to), such that 

xS„i (x) =x[f{x + m)-f (x)] 
= mxf' {x + S) 

= mxf (x) + mx [f (x + S)- f (x)] 
< mxf' (x) . 

The inequality is because / is strictly concave. 
From jy, 

/•CO 

f{x) = l-x (j){t) e-'^'dt, 
Jo 

where 

^ , , 1 1 



is strictly increasing, with linit^o ^ (^) = ^ linif^oo (f> (*) = 1- Therefore, 

xf {x)=x (j) {t) e"*^ {xt - 1) dt 
Jo 

= x'^ Jj (Pit) e-*" (^t -l^dt + x'^ (/> (t) e-*" (^t - dt 
< (0) ^ e"*^ (^t -^^dt + x'^4> (oo) e"*^ f t - 



< 



Note that 



and 



it follows that 



/OO 



xf {x) < 



The properties of / guarantee that g (n) decreases at the rate of ^ . The result 
is formulated in the following Lemma. 

Lemma 4. Let Dir [n\, • • • , n^) and Dir {n\, • • • , n^) &e the prior and the pos- 
terior distribution, such that n* = n° + t. Let s* = argmaxj, Then 



2n 



< 9 (n*) < 



S-1 
2n 



Proof. The upper bound is because < / (x) < | and / is increasing, thus 

E/K)-/(-*)=/K)-/K)+E/K) 

5-1 

< 



The lower bound follows from the fact that / (x + m) — f (x) is decreasing. We 
show that the trajectory minimizing g (n*) is the one such that all t observations 
equal to s*. To see this, let rus be the number of times observing s ^ s*, then 

/ (n° + ms)+f (nO. + m,.) = / (n°) + / (n^. + m,. + m,) 

+ /(n° + m,)-/(nO) 
- (/ (n^. + m,. +ms)~f (n°. + m,.)) . 

Note that + mg* > n", so 

/ (n° + m,) + / + m«.) > / (n°) + / (n°. + m,. + m,) . 
Now assume the observations are all s*, from sub-additivity, 

E/K)-/K) 

s 

= E/K) + /K-+*)-/K+*) 



s^s* 



= E/K)-/ 




+ /(n°.+t)-/(nO+t) 



>E/K)-/ E-° 



A little remark: The bounds hold irrespective of the data generating process, 
namely, it holds for any sequences of observations, including sequences with zero 

probabilities. 

The following Lemma bound the variation of the expected information gain, 
when one single observation is added. 



Lemma 5. Let n = [ni, • • • ,ns] and n' = [ni, • • • , n^-i, Ug + l,ns+i, • • • ,ns], 
then 



Proof. Without loss of generality let s = 1. Note that 



ni 
n 



[g {n) - g {n)\ 



ni 



1 



n I n + 1 



_ ni f 
~ n \ 



^/K) + /(ni + l)-/(n + l) 



E./K) /(n+l) ^ /(ni + l)-/(ni) ^ / (n) E./K) 
n+1 n+l n+l n n 



ni f /(n + l) E./K) I /(ni + l)-/(ni) ^ /(n) 

n\ n + l n(n + l) n + l n 

nif /(n + l) ng(n) + /(n) ^ / (m + 1) - / (m) ^ / (n) 
n\n+l n(n+l) n+l n 



ni 



n(n+ 1) 
1 



{-Si {n)+Si (ni)-5(n)} 



= ^ — \ ni(5i (ni) - — • n^i (n) - — 
n (n + 1) 1 n n 

Prom the previous Lemma, 



so 



thus 



0<a;<5i (a;)< ^, < f (x) < ^, 



S ni S ni , . , 1 

< 7^ < —[9{n')-9{n)] < 



2in? n 2n2 n 

^l.(n')-.(n)|<^. 



2n2' 



4.4 Bounding the Difference Between and 

In this subsection wc present the result bounding the difference between and 
Qa, without making any assumptions to the environment. Let = miUg min^ as,a 
the main conclusion of this subsection is that 

ka {s,a) - Qa {s,a)\ ~ 4"- 



Lemma 6. qa {s, a) < 2{i-'f)cc 



Proof. From Lcmmaj4j write Kq = ^y^, then 



Ko Ko 
g{as,a) < < — , Vs,a 



By definition, 



< Ko ( — + 7 ^- 

<^(l + 7), 

Cry 



since c„<](s_a,s') ^ Cq,. Repeat the process, it follows that for any r. 



qr (s,a)< ^(l + 7 + ---+7""') < 



Ca ^ (1 - 7) Ca 2 (1 - 7) Ca ' 

thus ^ 

ql (s, a) = lim g^'^ a) < 7, — ■ 



Lemma 7. Lei n = [ni, • • • , ns], afid 
6e S non-negative numbers. Define ps - 
s = 2, • • • ,5*. Then 



[ni + 1,712, •• • ,ns]. Let xi, ■ 



n ' 



n+1 



/or 



Pi 



< 



rj — ^ 



Proof. Simple derivation gives 

Pi iP's-Ps)Xs 



Pi ■ \p's-Ps\ 



If s = 1, 



Pi ■ \p'i - Pi I ^ ni 
Pi n 



ni + l ni^ 

n+1 n 
"1 



n — ni 1 

— ^ < -. 

n (n + 1) n 



If s ^ 1, 



Therefore, 



and 



111 



< 



n{n+l) n 



Pi ■ \p's-Ps\ ^ 1 

max — < — , 

s Ps n 




Lemma 8. For any a, , , there is some constant K depending on S and 7 
only, such that 

K 



max |gc»<(st,ot,a) {s,a) -qa{s,a)\ < 



,2 ■ 



Proof. First change the notations. Let Sq = s\ ao = . Also let = a, 
= a < (s^, a^, s). The result to prove becomes 



^T^^maxlg^i (si,ai) - q^i (si,ai)| 



Consider the finite time horizon approximations of g^i and g^i , namely 
and q^i . With a little abuse of notation, we drop the superscript 7 in this 
proof. Note that this shall not be confused with the finite time horizon curiosity 
Q- values without discount. 

For T = 2, consider the following inequality: 



-«imax|g2, ^si,ai)-ql. (si,ai)| 
<^ ax|5K,„J-<?«,aJ| 



+ 7^=Mimax 



S2 



ai ^siai 



+ ^^p^ max V max | (s2, as) - 9^2 (s2, as) | . 



Oi ci\ — ft 



Here = Z?-'^ <i (siaiS2), = < (siaiSs). Note that the error between 
and g^i has been decomposed into three terms. The first term captures the 



difference between the immediate information gain, the second tei 


m captures 


the error betwee 


n transition probabilities, and the third term is of the same 


form as the left side of the inequahty, except r is decreased by 1. 


1 QITTITI lt'\7' 
J-U OililJJilij' 


the notation, let F* be the operator 






IP [ J — 2^ t-i "l*^-^ i J ■ 




For fixed r, let 






St = {g 






+ 7 


/ /3* a* \ 

at ^t maxg^,,a^*+i,a*+i) 




and 






One can write 


FVi < Fi(5i +7F1f2</>2. 




Repeat this process for general r, it follows that 




FVi < V^Si + 


7F^F2(/|2 




< ¥^5i + 


7FI [F2,52 + 7F2f3(^3] 




= ¥^Si + 


7F1f2(52 + 72F^F2f3(/)3 




= ¥^Si + 


7F1f2,52 + • • • + 7*"iFi • • • F*,5t + • • • + -f-'^¥'' ■ ■ ■ ¥^-^br-i 


+ ^^-1f1...F^0^. 




Now look at a particular term in the inequality above, for example. 




■¥'6t = y max - • .y max 5*. 




Note that if (st,at) ^ (s^, a^) then St = 0, since /3* and a* differ only 


in the entry 


indexed by (s\a 


' , si). The following discussion assumes that {st, at 


= (st,at). 


From Lemmajsj let Ki = ^, then 











From Lenima[6] and [Tj there is some K2 depends only on S and 7, such that 




statst + l 



maxg^t+i {st+i,at+i) 



< 



st+1 



maxg^t+*i (s4+i,at+i) 



< 



K2 



where c^t = min^ min^ al ^. In combination, there is some Kq such that 

Ot < i ■ 



The next step is tricky: Assume that the pohcy is given, say, it is already 
the policy maximize • • • ¥*St, so that each a is a deterministic function of the 
prior and the previous history. Consider a trajectory soaosiai ■ ■ ■ Stat, the 
predictive probability of seeing such a trajectory is given by 



p{siai ■ ■ ■ Stat\soao) 



soao 



st-iat-ist 
«St_iat„i 



Again, if {st,at) ^ (s^, a^), then p (siOi • • • Stat\soao) St = 0. Otherwise, 
p{siai ■ ■ ■ stat\soao) St 



a2 

"siaiS2 




ist 








t-1 


-1 


V ai 


"siaiS2 




iSt 








t-1 


-1 






"siaiS2 




iSt 








t-1 


-1 






"siaiS2 




iSt 








t-1 


-1 







Note that 



a 
a 



2 

siaiS2 
2 



< 



< 



< 



t-1 



p(s2a2 • • • stfltlsiai) 



is the probability of seeing the trajectory 5202 • • • SfOt, when the agent assumes 
prior < (so,ao,si) = a^, and follows the same policy starting from (si,ai). 
Clearly, 

X! ■ ' ■ (^202 ■ ■ • Stfltlsifli) = 1, 



which leads to 

Fi • • • ¥*6t ■ (^1^1 ■ ■ ■ «*«*l'Soao) St 

Si S2 St 

- X! ^ X! " ■ ^^202 • • • stat\siai) 

Sl S2 St 



Putting the equation back, and note that c^i = is a constant on a, one 

has 

FVi < ]F^(5i + -f¥VS2 + ■■■+ j^^V ■ ■ ■ ¥*6t + ■■■+ 7^"¥i • • • ¥^-^Sr-i 

+ Y-^¥^ ■■■¥^(j>r 

< ^l+^r-iFi...F-0,. 

1 - 7 Ca 

From Lcmnia|6j since the curiosity Q-values are bounded, there is some such 
that 

</>r = {qpt {st,at) - qit {st,at)\ 

< \qlt {st,at) \ + \qit {st,at)\ 

< \qpt {st,at)\ + \qat {st,at)\ 

thus 

FV.<^^+7^-^. 

1 - 7 Ca Ca 

Let T — oo, one has 

y] T"^' max \qj3i (si,ai) - q^i (si,ai)| < ^, 



where X = is a constant depending on S and 7 only. 



The central result in this subsection is given by the following Proposition. 



Proposition 5. There is some K > depending on S and 7 only, such that 



\qa (s,a) - (s,a)| < ^. 



Proof. Write Eq[5] into the following form 

Qa {s,a) = g (as^a) +7 "^^^^ maxQa {s',a') 

+ 7 > max qa<,{s.a.s') (s , a ) - max (s , a ) 



The last term is bounded by 



= y max ga<i(s,a,s') (s , a ) - max (s , a j 

, Ois.a a' ^ ' a' 

s 

< V max ga<i(s,a,s') (s , a ) - (s , a ) • 



Apply LemmajTj it follows that there is some constant Kq depending on 5* and 
7 only, such that 



max \ qa<{s,a,s') [s ,a) - qa{s ,a)\ < 



„2 • 



Therefore S < ^. 

Now compare qa and qa'. 

max max \ qa (s, a) — qa (s, a)| 

s a 

ryX^ (max get (s', a') — maxq-Q, (s', a') ) + 7(5 



max max 

s a 



< 7maxmax|ga (s,a) - (s,a)| + 

ci 



s a 



Therefore 



max max \qa [s, a) — qa [s, a)\ < • 

■ - 1-7 



s a 



Letting K = completes the proof. 



4.5 Quality of the Approximation in Connected Markovian 
Environment 



Proposition[5] guarantees that the difference between qa and qa decreases at the 
rate of c^'^ . However, this alone is not enough to guarantee that qa converges 
to qa when the agent operates in the environment. For example, consider the 
environment consists of two connected components. In this case, Cq is upper 
bounded since that in one of the connected component ag^a never increases. 
Here we make the following assumption: 

Assumption III The environment is finite Markovian with dynamics p (s'|s, a), 
and the Markov chain with transition kernel 

P{s'\s) = ^ ^p{s'\s,a) 

aeA 

is irreducible. 



The first half of the assumption ensures that converges to p{s'\s,a) 

when as,a goes to infinity by Law of Large Numbers. The second half of the 
assumption implies that it is always possible to navigate from one state to an- 
other with positive probability of success. Therefore, if some g {as^a) is large, the 
information is guaranteed to propagate to all the states. Under this assumption, 
we prove in this section that when i — )■ c». 



qa (s, a) 



qa (s, a) 



- 1 



0, 



namely, the curiosity Q-value and the DP approximation are getting arbitrarily 
closer along time. 

The proof is unwrapped in three steps. 

Lemma 9. Assume IV), and the agent chooses the action greedily with respect 
to qat , where a* is the posterior after t time steps. Then for any s,a, 



lim Q = oo, a.s. 



Proof. Note that a* ^ ^, is non-decreasing, and can only increase by one if in- 
creasing. Therefore, limt_j.oo a* ^ < oo implies that there is some Tg a and Cs.a 
such that for all t > T, a* ^ = Cs,a- 

The complement of limt_>.oo ctl = oo for all (s, a) is that 3A C S x A, A (d, 
and BTg^a, Cs,a for all (s, a) S A, such that a*_^ = Cs^a for all t > Ts,a- Since there 
are only finitely many (s, a), this can be simplified to 3A ^ 0, 3T, 3cs.a, such 
that a* a = Cs^a for alH > T and (s, a) € A. 



Fix yl 7^ 0, T and Cg.a, we show that the event a* ^ — t > T and 

(s, a) £ A is a nuU event. Let yl = 5 x .4\yl, by definition, a* „ — >■ cxd for all 
(s, a) e y1. Clearly, A is not empty. Define 

Si = {s € S : 3a, a" such that (s, a) e yl, (s, a") ^ yl} . 

Namely, Si is the 'boundary' between A and /I. 

The first step is to show 5/ 7^ if yl ^ 0, or more precisely, the event 5/ = 
and yl 7^ is null. Assume 5/ = and yl ^ 0, then yl must satisfy that if 
{s, a) G A for some a, then (s, a) G yl for all a. Let Sa C S he the set of s such 
that (s,a) S yl. Clearly, once reaching s G Sa, any action chosen would cause yl 
be visited, which can only happen for finitely many times. This implies that for 
any s G Sa, any state action pair {s',a') such that p{s\s',a') > can only be 
visited finite number of times almost surely, because the probability of sampling 
from p{-\s',a') for infinitely many times but only getting finite number of s is 
zero. From Assumption IV), for any Sa 7^ S, there is always some {s',a') such 
that s' ^ Sa and p{s\s',a') > 0, so {s',a') can only be visited finitely many 
times, by definition {s',a') G A, which contradicts with the fact that s' ^ Sa- 

Next we show that at least for one ,s G Si, following the optimal strategy 
leads to some (s, a) G yl being visited. For t > T, Define 

q{s,a) = r{s,a) p{s'\s,a)maxq{s',a') , 

^— ' a' 

with 



( p{s'\s,a),i{ { 



p (s'|s, a) , if (s, a) ^ yl 
) Gyl 



and 

JO, if {s,a)^A 



9{al,a) 1 if {s,a) e yl ' 

Clearly, p and r do not depend on t, and q is the unique optimal solution. Now 
let (s^, a^) G yl be the pair such that s G Si, and 

q(s^,a^)= max q(s,a). 

It can be seen that for any a' such that (s^,a') ^ yl, g(s^,a') < 7g(s^,a^). 
The reason is the following: Performing a' leads to zero immediate reward since 
(^s^,a') ^ yl. Let s" be the result of the transition, then either s" G Si, so 
maxa" q {s" , a") < q(s^ , a^) , or s" is some other state such that (s", a") ^ A for 
all a". (Note that s" cannot be a state such that {s",a") G yl for all a".) In the 
latter case, since s" is only connected to states in yl through <S/, it must be that 



since at least one more step must be made to reach Si first. Taking into account 
the discount, it follows that 

q {s\a^) - q {sla') > {1 - -f) q {s\a^) . 

Replace q with q^t leads to 

g„t (s^a'f) (s^a') > (1 - 7) g (s^ a''') 

+ qat {s\a^) -q{s\a'') + g^t {s\a') -q{s\a') 

From the initial assumption, when t > T, <^s^,a^) is never visited, also, the 
action is chosen greedily with respect to q^t . This implies that at least for one 
a' such that (s^,a') ^ A, 

q^t {s\a)) - q^t {s\a') < 0, 

> (1 - j)q{s\a^) + cjat {s\a^) - q {s\a^) + q^t {s\a) - q {s\d) 
> (1 — 7) q (s^, a^) — 2 max max Iq^t (s, a) — q (s, a) \ , 



or 



s a 



which leads to 

\ — 7 

max max |(jQ,t (s, a) — q (s, a)\ > — - — q (s' , a') 

Note that 

\qat {s,a) - q{s,a)\ < |g (aL) -r{s,a)\ 



-7 



E 



-p{s'\s,a] 



max^ct* (s', a') 



Ep (s'|s, a) max {q^t (s', a') - g (s, a)| , 



so 



max max I^q.* (s, a) — q (s, a)| 

s a 
1 



< 



+ 



Yz:^ kK^a) -r{s,a)\ 



max max 

1 — 7 s a 



E 



a; 



at 



'P(s'|s,a) 



maxg^t* (s', a') 



From Lemma|4| 



(as,a) -''(s,a)| 



S — 1 

max g(a* ) < — ^ > 0. 



Therefore, there is some T' such that \g (a* o) — (s, a) | < ^^^9 [s^ , a^) for all 
(s, o) ^ yl. Also note that 



qat is',a') < 



S-1 



where Cq, = 



^^^{s,a)eA Cs,a- Let K = 2(i%)^c ' ^^^"^ 



if max max 

s a 

7 



E 



2(1-7) Ca' 

s- 

(1-7) 

p(s'|s,a) 



+ i^^(at,at) 
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> max max Iq^t (s, a) — q (s, a) 
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thus 



max 
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s,a,s' 



at 



■p(s'|s,a) 



max max 

5 a 



E 



■p(s'|s,a) 



>^^(.t,at), 



for all t > T' . This implies that when t ^ 00, the empirical ratio "y'"' does 

not converge to p (s'|s, a), which is a null event because it contradicts the Strong 
Law of Large Numbers. This in turn implies that for fixed A, T and Cg^a; the 
event 3A ^ 0, 3T, 3cs,a, such that a*_^ = Cg^a for alH > T and (s, a) G yl is null. 

As the last step, notice that there are only countably many such events, and 
since the union of countably many null events is still null, one can conclude that 
limt_>oo a = for (■5) 0) holds almost surely. 



The next step is to show that all qa* (s, a) decreases at the rate lower bounded 
by c~^. Let qg^a be the Q-value of performing a at state s, assuming the reward 
is 1 at all states. Namely, qg^a is the solution of the following Bellman equation 



qs,a = I + ^^p{s'\s,a)^qs' ,a' 

s' a' 



Clearly, qs^a > from Assumption IV). Define q = mins^a^s.a- Also, let 

Us, a — n J 



where is the initial a representing the agent's prior, and s* = argmaxs' 
as defined in Lemma|4] Define u = min^^a Ug.a- 

Lemma 10. Assume IV), and let c^t = min^ mina a* then 





lim inf c^tg^ 

t— f oo 


t (s, a) > uq, a.s. 


Proof. Let q^t be the solution to the following Bellman equation: 


ga* (s 


a) = 5 ("s.a) +l^^v{s'\s-:0)^^;^qa* {s',a') . 

s' 


Clearly, for any (s. 


a), 

9 H,a) > 


Us,a ^ 


and because q^t is 


optimal. 






(s,a) > — 

Cat 


-qs,a > —, Vs,a, 


or Catqat (s, a) > uq. 




Fix an e > 0, we show that 






lim inf Catq^t (s, 

f oo 


a) < ug (1 — e) , Vs, a 


is a null event. Ass 


uming liminft_5.oo < 


-a'^a' ^ ^^^(l ~ a-nd following similar 


procedure as in the proof of Lcmmajo 


let K= then 


max max > 

s a ^ 

s' 


a* , 

p(s s,a) 


> max max |g(jt (s, a) — g (s, a)| 

K s a 






- 






holds for infinitely many i, which contradicts again with the Law of Large Num- 


bers. 






Let En — \^ then the union of the 


countably many events 




lim inf Catqa* {s, a) < uq{l — £„) , Vs, a 


is again a null event, therefore 






lim inf Catq^ 

i— f oo 


t (s, a) > uq, a.s. 



Combining Lemnia|9] and [T0| produces the following proposition. 



Proposition 6. Assume IV), and that the agent acts greedily with respect to 
Qa, then 

lim — 

and there is some K depending only on the dynamics and 7, such that 



- 1 



= 0, a.s. 



lim sup Cq, 

>-oo 



- 1 



< K. 



Proof. Note that 



< 



K 



Cq^ Cq, ([q^ 



where K is given in PropositionjSj Use Lemmaj9j [TOj the result follows trivially. 



5 Experiment 

The idea presented in the previous section is illustrated through a simple experi- 
ment. The environment is an MDP consisting of two groups of densely connected 
states (cliques) linked by a long corridor. The agent has two actions allowing it 
to move along the corridor deterministically, whereas the transition probabilities 
inside each clique are randomly generated. The agent assumes Dirichlet priors 
over all transition probabilities, and the goal is to learn the transition model of 
the MDP. In the experiment, each clique consists of 5 states, (states 1 to 5 and 
states 56 to 60), and the corridor is of length 50 (states 6 to 55). The prior over 
each transition probability is Dir ( , . . . , ) . 

We compare four different algorithms: i) random exploration, where the agent 
selects each of the two actions with equal probability at each time step; ii) Q- 
learning with the immediate information gain g {ao\\h) as the reward; iii) greedy 
exploration, where the agent chooses at each time step the action maximizing 
g(a||/i); and iv) a dynamic-programming (DP) approximation of the optimal 
Bayesian exploration, where at each time step the agent follows a policy which 
is computed using policy iteration, assuming that the dynamics of the MDP is 
given by the current posterior, and the reward is the expected information gain 
g{a\\h). 

Fig[2]shows the typical behavior of the four algorithms. The upper four plots 
show how the agent moves in the MDP starting from one clique. Both greedy 
exploration and DP approximation move back and forth between the two cliques. 
Random exploration has difficulty moving between the two cliques due to the 
random walk behavior in the corridor. Q-learning exploration, however, gets 
stuck in the initial clique. The reason for is that since the jump on the corridor 



is deterministic, the information gain decreases to virtually zero after only sev- 
eral attempts, therefore the Q-value of jumping into the corridor becomes much 
lower than the Q-value of jumping inside the clique. The bottom plot shows how 
the cumulative information gain grows over time, and how the DP approxima- 
tion clearly outperforms the other algorithms, particularly in the early phase of 
exploration. 

6 Related Work 

The idea of actively selecting queries to accelerate learning process has a long 
history (21318) . and received a lot of attention in recent decades, primarily in the 
context of active learning [5] and artificial curiosity [7]. In particular, measuring 
learning progress using KL divergence dates back to the 50's |5I3) . In 1995 this 
was combined with reinforcement learning, with the goal of optimizing future 
expected information gain pj]. Others renamed this Bayesian surprise [4]. 

Our work differs from most previous work in two main points: First, like in 
[llj . we consider the problem of exploring a dynamic environment, where ac- 
tions change the environmental state, while most work on active learning and 
Bayesian experiment design focuses on queries that do not affect the environ- 
ment [3] . Second, our result is theoretically sound and directly derived from first 
principles, in contrast to the more heuristic application [11] of traditional rein- 
forcement learning to the problem of maximizing expected information gain. We 
formulated the concept of curiosity (Q) value, and highlighted the necessity of 
balancing immediate information gain and long-term expected information gain 
(see Eq[4|. In particular, we pointed out a previously neglected subtlety of using 
KL divergence as learning progress. 

Conceptually, however, our work is closely connected to artificial curiosity 
and intrinsically motivated reinforcement learning [7|10|8j for agents that ac- 
tively explore the environment without external reward signal. In fact, the very 
definition of the curiosity (Q) value permits a firm connection between pure 
exploration and reinforcement learning. 

7 Conclusion 

We have presented the principle of optimal Bayesian exploration in dynamic en- 
vironments, centered around the concept of the curiosity (Q) value. Our work 
provides a theoretically sound foundation for designing more effective explo- 
ration strategies. Based on this result, we establish the optimality of the DP 
approximation of the optimal Bayesian exploration in the MDP case. 
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Fig. 2. The exploration process of a typical run of 4000 steps. The upper four 
plots shows the position of the agent between state 1 (the lowest) and 60 (the 
highest). The states at the top and the bottom correspond to the two cliques, 
and the states in the middle correspond to the corridor. The lowest plot is the 
cumulative information gain with respect to the prior. 



